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Abstract 


An approach is documented for analyzing probabilistic properties of fault-tolerant distributed systems using 
the PRISM model checker. 


1 


Contents 


1 Introduction 3 

2 PRISM Overview 4 

2.1 Modules 4 

2.2 Property Specification 6 

2.3 The PRISM Tools 8 

2.4 Other Features of PRISM 10 

3 Probabilistic Analysis of SPIDER Using PRISM 10 

3.1 SPIDER 10 

3.2 Maximal Fault Assumption 11 

3.3 Reliability Models in PRISM 12 

3.3.1 Baseline Model 12 

3.3.2 Properties 14 

3.3.3 Analysis Results 15 

3.3.4 Revised Model 17 

3.4 Protocol Models in PRISM 18 

3.4.1 SPIDER’s Broadcast Service 21 

3.4.2 PRISM Model 21 

3.4.3 Protocol Properties 24 

3.4.4 Analysis Results 25 

4 Conclusion 25 

A Reliability Models of SPIDER 29 

A. 1 Baseline Model 29 

A. 2 Revised Model 34 

B Model of SPIDER’s Broadcast Service 39 


2 


1 Introduction 


A method for probabilistic analysis of fault-tolerant distributed systems using a probabilistic model checker 
is presented. This description includes several example analyses based on the Scalable Processor- Independent 
Design for Enhanced Reliability (SPIDER) architecture. 

A long history exists in applying probabilistic models to estimate the reliability of redundant systems. 
Methods based on Markov chain analysis are prevalent in this context as they enable modeling of state- 
dependent stochastic phenomena, which is necessary to represent dynamic system features such as reconfig- 
uration and component repair. The general principles behind these methods, and many examples of Markov 
(and semi-Markov) models of fault-tolerant systems, are presented by Butler and Johnson in [5]. 

Different software tools have been developed to solve the Markov or semi-Markov chains encountered 
when modeling fault-tolerant systems. Geist and Trivedi survey several tools available in the early 1990s and 
present the solution techniques that they employ [10]. All these tools attempt to address two fundamental 
issues. First, Markov models of fault-tolerant systems can have very large state spaces. This large size is 
caused both by the number of components in such systems and by the complex reconfiguration and diagnosis 
strategies fault tolerance often requires. Second, the transition rates encountered in the Markov models can 
vary widely in magnitude — since fault-arrival rates are slow but system reconfiguration is fast — which leads 
to difficulties for numerical algorithms. The Semi-Markov Unreliability Range Evaluator (SURE) [4] is 
dedicated to the analysis of fault-tolerant systems that exhibit low fault rates and fast reconfiguration. It 
relies on approximation theorems to give lower and upper bounds on system reliability. The Symbolic 
Hierarchical Automated Reliability and Performance Evaluator (SHARPE) [27] uses hierarchical modeling 
to mitigate the state-explosion problem. Simulation-based methods that rely on importance sampling have 
also been proposed [9, 30]. 

More recent tools such as Mobius [8] or the Symbolic Model checking Analyzer for Reliability and 
Timing (SMART) [6] are general frameworks for assessing reliability, performance, and other properties of 
complex computer systems. They employ sophisticated techniques for compactly representing the model 
state space and they use decomposition methods for reducing the analysis cost. These tools are generic and 
have been applied to a variety of computer systems and networks. Example applications to fault-tolerant 
systems include [3] and [31]. A more complete survey of recent modeling formalisms and analysis tools can 
be found in [28]. 

Typically, the probabilistic analysis of Markovian models focuses on computing transient or steady-state 
distributions of a Markov chain. Thus, the previous tools can typically answer questions of the form “what 
is the probability of reaching a set of states S?” or “what is the steady-state probability of being in state S?”. 
Probabilistic model checkers are a related class of tools that provide a richer language for querying models, 
which allows one to consider not just reachability properties but also properties of execution paths in an 
underlying Markov model [17]. Properties are expressed in probabilistic temporal logics, most of which 
are extensions of the Probabilistic Computational Tree Logic (PCTL) or the Continuous Stochastic Logic 
(CSL) introduced in [12] and [2], respectively. As the name indicates, probabilistic model checking is an 
extension of traditional, nonprobabilistic model checking to probabilistic and, more generally, quantitative 
system properties. For example, a probabilistic model checker can be used to formally verify that a system 
model, usually given as a finite Markov chain or Markov decision process, satisfies properties such as “the 
failure probability is less than 10~ fc ” or “the probability that component A will fail before component B is 
more than p”. 

Several probabilistic model checkers are available, including PRISM [15], the Markov Reward Model 
Checker (MRMC) [16], the Erlangen-Twente Markov Chain Checker (ETMCC) [13], Ymer [34], and 
VeStA [29]. Among these, PRISM is recognized as the leading software tool. It has been continuously 
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developed and maintained since 2000, and it has been applied to a wide range of case studies. It can be 
downloaded for free at http : //www . prismmodel checker . org/. In this report, we investigate the 
use of PRISM to the analysis of high-reliability, fault-tolerant systems. 

We give an overview of the PRISM modeling language and logic. We then present several PRISM 
models and analysis of a representative high-reliability system — namely, the SPIDER fault-tolerant archi- 
tecture [21,33]. 

2 PRISM Overview 

PRISM is a tool for the formal modeling and verification of stochastic systems. The current release (i.e., 
PRISM 3.3. 1) supports three types of probabilistic models: discrete-time Markov chains (DTMCs), continuous- 
time Markov chains (CTMCs), and discrete-time Markov decision processes (MDPs). Support for proba- 
bilistic timed automata (PTAs) has been announced for the next release (i.e., PRISM 4.0) [18]. PTAs are 
an extension of MDPs with special real-valued variables called clocks that measure time progress. PTAs 
are then intended for modeling and analysis of real-time, probabilistic systems. A preliminary version of 
PRISM 4.0 is available as a beta release, but all the work and experiments described in this report were 
performed using PRISM 3.3.1. 

2.1 Modules 

A PRISM system is specified as the composition of one or more elementary systems called modules. Each 
module is defined by its state variables, which must all have a finite type, and by a set of guarded commands 
that describes state transitions and the associated probabilities. 

Figure 1 shows a PRISM specification that defines a simple fault model. The specification starts with 
the keyword ctmc to indicate that the model is a continuous-time Markov chain. Alternatively, one can 
use the keywords dtmc or mdp for a discrete-time Markov chain or Markov decision process, respectively. 
The model then declares three parameters named lambdal, lambda2, and lambda3 that define three 
transition rates. In this particular example, lambdal is intended to be the transient-fault rate; lambda2 is 
the permanent-fault rate; and lambda 3 is the recovery rate from transient faults. 

The module has a single state variable called state whose value is an integer in the set {0, 1,2}. The 
intended interpretation is that 0 is the fault-free state of the system, 1 is the state after a transient fault, and 
2 is the state after a permanent fault. Initially, the system is fault free: state is initialized to 0. 

The rest of the module consists of three guarded commands, which define the transition rates of the 
underlying Markov chain. The first command is 

[fault] (state = 0) -> lambdal: (state' =1) + lambda2: (state' =2); 

It defines the transitions originating from state 0: with rate lambdal the system moves to state 1 and with 
rate lambda2 the system enters state 2. This command is labeled by the identifier fault, which is used 
later in specifying rewards. The other two guarded commands in Figure 1 do not have a label. The transition 
rates such as lambda are optional. If they are not given in the guarded command, then PRISM will assume 
a default rate of 1 (as in the transition from state 2). The Markov chain corresponding to module cl of 
Figure 1 is depicted in Figure 2. 

The other parts of the PRISM model in Figure 1 include the definition of a set of states, identified by the 
label failed, that corresponds to the faulty states of the system (i.e., when state is different from 0). 
Also, three rewards are attached to the model. In general, a reward is a numerical quantity that is assigned to 
particular states or transitions of the system. In this example, the reward functional is equal to 1 when 
state=0 and 0 otherwise. The PRISM logic includes several constructs to express properties of these 
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ctmc 


const double lambdal; // transient faults 
const double lambda2; // permanent faults 
const double lambda3; // recovery from transient faults 

module cl 

state: [0..2] init 0; 

[fault] (state = 0) -> lambdal: (state' =1) + lambda2 : (state' =2); 

[] (state = 1) -> lambda3: (state'=0); 

[] (state=2) -> (state'=2); 

endmodule 

label "failed" = (state=l) | (state=2); 

// Rewards: 1 for state 0, 0 in all other states 
rewards "functional" 
state =0 : 1; 
state != 0: 0; 
endrewards 

// For computing expected times 
rewards "total_time" 
true: 1; 
endrewards 

// For computing number of faults 
rewards "num_f ailures " 

[fault] true: 1; 
endrewards 


Figure 1 . Example PRISM Model 
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Figure 2. Markov Chain Corresponding to the PRISM Model in Figure 1 


reward functions — for example, the expected cumulative reward in a bounded interval [0, t\. In the case of 
the functional reward, this is the average fraction of time during which the system is fault free. The 
reward function total-time can be used in a similar fashion — for example, to define the expected time to 
the first failure. As illustrated by num_f ailures, rewards can also be attached to state transitions. In this 
case, num_f ailures is equal to 1 every time the transition fault is taken, and to 0 on other transitions. 

2.2 Property Specification 

PRISM uses an extension of PCTL for expressing properties of Markov chains. The logic includes the 
traditional Computational Tree Logic (CTL) [7] operators that describe possible execution trees of a state- 
transition system and new operators that express probability constraints on these execution trees. The un- 
derlying semantics is then based on a probability space defined over the execution traces of a Markov 
chain [17,25,26]. 

Figure 3 shows example logical formulas related to the PRISM model defined previously and their 
informal interpretation. The first five examples illustrate a very common pattern of queries one makes about 
PRISM models. These queries are all of the form 

P =? [ path-property ] 

Such formulas ask for the probability that the Markov chain’s execution path satisfies the given path property. 
The path properties are themselves written using the CTL operators G (always), F (eventually), and U (until). 
For example, the path property F state=l means “eventually the system reaches state 1”. It is satisfied 
by all paths that pass through state 1, and the formula P=? [ F state=l] asks for the probability of 
this set of paths. 

FormulaP=? [G<=10.0 ! "failed"] uses a bounded form of the CTL operator G to make a query 
about failure probability for a bounded mission time. It uses the label failed defined in Figure 1 to 
refer to the faulty states. Property G<=10 ! "failed" is satisfied by all paths that spend at least 10 time 
units in the fault-free state — that is, all paths where the first fault occurs after time 10. Hence, formula 
P=? [G<=10 .0 ! " failed] asks for the probability that no fault occurs in the first 10 time units of the 
system behavior. 

Rather than asking PRISM to compute probabilities of interest, one can also assert bounds on the proba- 
bility of certain sets of paths and verify with PRISM whether these bounds actually hold. This is illustrated 
by the sixth property of Figure 3. The formula P>0 . 9 [ G<=10.0 (state! =2) ] is true in a state s 
of the Markov chain if the set of paths that start from s and do not reach state 2 in the first 10 time units has 
probability at least 0.9. Then property 

"init" => P>0 . 9 [ G<=10 . 0 (state! =2) ] 

holds for the system if all initial states satisfy the probability constraint. 
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// What's the probability of eventually experiencing a transient fault? 
P=? [ F state=l ] 


// What's the probability of a 
P=? [ F state=2 ] 

// What's the probability that 
P=? [ state=0 U state=l ] 


// What's the probability that 
P=? [ state=0 U state=2 ] 


// What's the probability that 
P=? [ G<=10 . 0 ! "failed" ] 


eventually experiencing a permanent fault 


the first fault is transient? 


the first fault is permanent? 


the system doesn't fail in a 10 hour run? 


// Is the probability of no permanent failure at least 0.9 
// (in a 10 hour run)? 

"init" => P>0 . 9 [ G<=10 . 0 (state!=2) ] 

// If we run for 10 hours, what's the expected amount of functional time 
R{ "functional " } =? [ C<=10.0 ] 

// What's the expected number of failures 
R{ "num_failures" } =? [ C<=10.0 ] 

// What's the expected time to the first failure 
R{ "total_time" } =? [ F "failed" ] 


Figure 3. Example PRISM Formulas for the Model in Figure 1 


Property 

Result 

P=? [ F state=l ] 

0.90909 

P=? [ F state=2 ] 

1.00000 

P=? [ state=0 U state=l ] 

0.90909 

P=? [ state=0 U state=2 ] 

0.09091 

P=? [ G<=10 . 0 ! "failed" ] 

0.99989 

"init" => P>0 . 9 [ G<=10 . 0 (state!=2) ] 

true 

R{ " functional" } =? [ C<=10.0 ] 

9.99994 

R{ "num_f ailures " } =? [ C<=10.0 ] 

1.098 x 10" 4 

R{ "total time"} =? [ F "failed" ] 

90909.1 




Parameters: lambdal = 10 5 , lambda2 = 10 6 , lambda3 = 10 


Table 1 . Model Checking Results for the Properties in Figure 3 

Finally, Figure 3 shows how PRISM can express quantitative properties using reward structures. For 
example, formula R{ " functional " } =? [ C<= 10.0] asks for the expected cumulated value of 
the reward function functional over the first 10 time units of a run. More examples of queries and 
formulas, and the exact semantics of the PRISM logical operators, are given in the PRISM manual [25]. 

2.3 The PRISM Tools 

To answer the previous queries or check formulas, PRISM can be used either as a command-line tool or 
via a graphical user interface. For example, one can check the properties listed in Figure 3 by using the 
following command. 

prism simple-f ault-model . sm simple-continuous . pctl 
-const ' lambdal=le-5 , lambda2=le-6, lambda3=10' 

This invokes the prism model checker on the model defined in file simple-fault-model . sm, for 
the set of properties stored in tile simple-continuous . pctl, and with specific values assigned to the 
model parameters lambdal, lambda2, and lambda3. 

For each property it evaluates, the tool prints statistics and results, as follows 

Model checking: P=? [ F state=l ] 

Model constants: lambdal=0 . 00001 , lambda2=0 . 000001, lambda3=10 
Starting iterations. . . 

Jacobi: 2 iterations in 0.00 seconds (average 0.000000, setup 0.00) 

Time for model checking: 0.0020 seconds. 

Result (probability): 0.9090909090909091 

This shows the probability of experiencing a transient failure for the Markov model of Figure 1. Table 1 
summarizes the prism results for the properties of Figure 3. 

PRISM also provides a graphical user interface (cf. Figure 4). This tool can be used for editing models, 
and specifying and verifying properties. More information on this tool and the prism command-line tool 
can be found in the PRISM manual [25]. 



Figure 4. PRISM Graphical User Interface 
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2.4 Other Features of PRISM 


In the example so far, we have considered only a PRISM model that consists of a single module. It is much 
more common to build PRISM systems by composing several modules. PRISM provides several module- 
combination operators, but the default parallel composition appeal's to be the most common: the modules 
perform transitions independently of each other, except that transitions having the same labels must be 
synchronized. This is a standard form of asynchronous product of labeled transition systems. The PRISM 
semantics document [26] explains how the transition probabilities are combined in such products. 

Another useful feature is the ability to define identical modules by renaming. This allows one to define 
a module B as a copy of another module A, while optionally changing the names of variables, parameters, 
or transition labels. 

In addition to building systems by combining modules, PRISM allows systems to include global state 
variables that can be read and written by all the modules. Some restrictions on the use of such variables 
ensure that only one module writes to a global variable at a time. 


3 Probabilistic Analysis of SPIDER Using PRISM 

We present applications of PRISM to the analysis of the SPIDER fault-tolerant distributed system. The 
standard reliability-estimation method for such a system starts with identifying a set of assumptions about 
the number and type of faulty components the system is designed to tolerate. Then, overall system reliability 
can be established in two steps: prove that the system actually remains functional under the given fault 
assumptions and estimate the probability that these assumptions hold. 

We first describe how this technique can be supported by the PRISM tool, and then we investigate the 
application of PRISM to other analysis of SPIDER, including protocol properties. 


3.1 SPIDER 

SPIDER [22,33] is a distributed computer architecture that provides a communication service for integrated 
modular avionics. The system relies on hardware replication, Byzantine-fault-tolerant protocols, and diag- 
nosis and reconfiguration to achieve very high reliability. The core of SPIDER is a Reliable Optical Bus 
(ROBUS) whose architecture is shown in Figure 5. ROBUS consists of N bus interface units (BIUs) and 
M redundancy management units (RMUs). Each BIU is connected to each RMU via a bidirectional com- 
munication link. Each BIU is also connected to a single processing element (PE). The goal of ROBUS is to 
provide a highly reliable communication medium to the PEs. 

More precisely, ROBUS implements a fault-tolerant clock synchronization algorithm to ensure that all 
the fault-free BIUs and RMUs are synchronized. This enables the PEs to access the bus by following 
the time-division multiple access (TDMA) strategy. All components in the system agree and enforce a 
system-wide communication schedule that prevents bus collisions. The redundant RMUs and BIUs provide 
a fault-tolerant consistent broadcast service by means of a Byzantine fault-tolerant voting algorithm. In 
addition, BIUs and RMUs implement an online diagnosis protocol to detect and exclude faulty components. 
Other important protocols used by ROBUS include a fault-tolerant startup and a reintegration protocol that 
enables nodes that are subject to transient failures to rejoin the system. The ROBUS protocols are described 
in detail in [33] and [11]. Formal modeling and verification of these protocols are presented in a series of 
publications, including [21,23,24]. 
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Figure 5. ROBUS Architecture 

3.2 Maximal Fault Assumption 

ROBUS provides correct service to the PEs under a specific set of assumptions about the number and severity 
of faulty components in the system. Following the hybrid fault model introduced in [32] (see also [23]), we 
assume that a BIU or RMU can be in one of the following four states: 

• Good: A node is good if it is free of faults. 

• Benign: A node is benign if it sends only messages that are manifestly wrong. This means that the 
recipients of such messages can accurately recognize that the message is incorrect. Since SPIDER 
components can detect that a message is missing, benign nodes includes fail-silent nodes. 

• Symmetric: A node is symmetric if it sends the same possibly incorrect message to all recipients, and 
the recipients may not detect that the message is incorrect. 

• Asymmetric: A node is asymmetric if it may send different, possibly incorrect, messages to each 
recipient. As previously, the recipients may not recognize that the message is incorrect. 

More refined fault classifications are possible (e.g., [1]) but we follow this taxonomy since it is used in the 
majority of the publications on SPIDER. 

Based on this taxonomy, we can partition the BIUs and the RMUs into four sets. Let GB, BB, SB, 
and AB denote the sets of good, benign, symmetric, and asymmetric BIUs, respectively. Similarly, let G It, 
BR, SR, and AR denote the sets of good, benign, symmetric, and asymmetric RMUs. Then, the SPIDER 
protocols provide correct services to the PEs provided the following constraints are satisfied [21]: 

1. \GB\ > \AB\ + |S5| 

2. \GR\ > \AR\ + |5i?| 

3. \AR\ = 0 or \AB\ = 0 
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Less formally, SPIDER can tolerate asymmetric components as long as they are all BIUs or all RMUs, 
and the system remains functional as long as there are enough good BIUs and RMUs to outvote the faulty 
components. 

The three conditions above are SPIDER’s maximal fault assumption. They describe the maximal number 
and types of faults that SPIDER has been proven to tolerate. 

3.3 Reliability Models in PRISM 

Our first reliability model of SPIDER follows the system architecture. It is given in the appendix (A. 1). The 
particular instance of SPIDER used in this study consists of three BIUs and three RMUs (i.e., N = M = 3). 
We then developed a more general model in which the numbers N and M are parameters. This second model 
is also given in the appendix (A.2). 

3.3.1 Baseline Model 

Our baseline model is a continuous-time Markov chain that closely follows the SPIDER architecture. This 
model includes three BIUs and three RMUs, each of them represented as a separate PRISM module. The 
PRISM specification of BIU1 is described in Figure 6. The modeling follows the fault classification pre- 
sented previously. It also distinguishes between permanent and transient faults. 

The state of a BIU is encoded as two variables: A variable specifies whether the BIU is good, benign, 
symmetric, or asymmetric, and a Boolean variable records whether the current fault is permanent or tran- 
sient. These two variables are named biul_state and biul_permanent in Figure 6. As indicated in 
the figure, the possible error states are represented by integers between 0 and 3. 

To model fault occurrences, we use two sets of parameters: 

• Transition rates: Two parameters define the fault-arrival rates for transient and permanent faults. 
They are called trans_lambda and perm.lambda, respectively. A third parameter is the rate of 
recovery from transient faults (called repair_lambda). All rates are assumed to be measured in 
number of faults per hour. 

• Probabilities: Three parameters define the probability that a fault is benign, symmetric, or asymmet- 
ric. The first two probabilities are given by parameters benign_ratio and symmetric-ratio 
in PRISM, with the third probability defined as 

const double asymmetric_ratio = (1 - benign_ratio - symmetric_ratio) ; 

Obviously, the actual values of benign_ratio and symmetric_ratio must be between 0 and 
1, and their sum must also be in the interval [0, 1]. PRISM cannot enforce these constraints, since it 
does not know the parameters’ interpretation. 1 

We have assumed that these probabilities are independent of whether the fault is transient or perma- 
nent, but it is easy to adjust the model to a different hypothesis. 

The PRISM module then specifies in a straightforward fashion the transitions from fault free to one of the 
possible error states. For example, a benign, transient fault occurs with a rate equal to trans_lambda 
* benign.ratio. If a fault is transient, then the BIU returns to the good state with a rate equal to 

repair_lambda. 

’PRISM does not currently support assertions on system parameters, so there is no easy way to encode these constraints in 
PRISM. 
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// Error states of each component 

const int good = 0; 

const int benign = 1; 

const int symmetric = 2; 

const int asymmetric = 3; 

// Parameters: fault arrival rates for each component 

const double perm_lambda; // permanent faults (e.g., 10~-6) 

const double trans_lambda; // transient faults (e.g., 10"-5) 

const double repair_lambda; // recovery from transient faults (e.g., 10) 

// Fractions of benign/symmetric/asymmetric faults 
const double benign_ratio; // 0.5 

const double symmetric_ratio; // 0.3 

const double asymmetric_ratio = (1 - benign_ratio - symmetric_ratio) ; 


// Fault arrival rates for the BIUs 
const double biu_perm_lambda = perm_lambda; 
const double biu_trans_lambda = trans_lambda; 
const double biu_repair_lambda = repair_lambda; 


// Fractions 

const double biu_benign = benign_ratio; 
const double biu_symmetric = symmetric_ratio; 
const double biu_asymmetric = asymmetric_ratio; 


// Error model for BIU1 
module biul 

biul_state: [0..3] init good; 

biul_permanent : bool init false; 


[] biul_state = good -> 

(biu_perm_lambda * biu_benign) : (biul_state' =benign) & 

(biul_permanent' =true) 

+ (biu_perm_lambda * biu_symmetric) : (biul_state' =symmetric) & 

(biul_permanent' =true) 

+ (biu_perm_lambda * biu_asymmetric) : (biul_state' =asymmetric) & 

(biul_permanent ' =true) 

+ (biu_trans_lambda * biu_benign) : (biul_state' =benign) & 

(biul_permanent ' =false) 

+ (biu_trans_lambda * biu_symmetric) : (biul_state' =symmetric) & 

(biul_permanent ' =false) 

+ (biu_trans_lambda * biu_asymmetric) : (biul_state' =asymmetric) & 

(biul_permanent ' =false) ; 

// recovery from transient faults 

[] ( ! biul_permanent & biul_state != good) -> 

biu_repair_lambda : (biul_state' =good) & (biul_permanent' =f alse) ; 


endmodule 


Figure 6. Baseline PRISM Model: A BIU 
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The other components of ROBUS — that is, the other BIUs and the RMUs — are modeled in the exact 
same way as BIU1. Their PRISM specification is identical to module biul of Figure 6, with state variables 
and parameters renamed appropriately. The details are shown in the appendix (A. 1). 

3.3.2 Properties 

The remainder of the PRISM model introduces various abbreviations and definitions to describe SPIDER’s 
maximal fault assumption. The model includes three BIUs, so part (1) of the fault assumption states that 
either at least two of them are good, or that one of them is good and the other two are benign. It is straight- 
forward to express these two conditions in PRISM: 

formula two_good_bius = 

(biul_state=good & biu2_state=good) 

I (biul_state=good & biu3_state=good) 

I (biu2_state=good & biu3_state=good) ; 

formula one_good_two_benign_bius = 

(biul_state=good & biu2_state=benign & biu3_state=benign) 

I (biul_state=benign & biu2_state=good & biu3_state=benign) 

I (biul_state=benign & biu2_state=benign & biu3_state=good) ; 

Clauses (2) and (3) of the fault assumptions are specified in a similar way, and the set of states that satisfy 
the maximal fault assumption is defined by 

label "functional" = 

(two_good_bius | one_good_two_benign_bius ) 

& (two_good_rmus | one_good_two_benign_rmus ) 

& (no_asymmetric_bius I no_asymmetric_rmus ) ; 

Then, the properties we examined include 

PI: probability that the system is always functional for a given mission time T. For example, for T = 10 
hours, this is written 

P=? [ G<=10 "functional" ] 

This PRISM query asks for the probability that the maximal fault assumption is satisfied at all times 
in the interval [0, T\. 

P2: The probability of system failure for a mission time of 1 hour is less than a bound po. For example, if 
Po is 10 -9 , this is written 

"all_good" => P<=le-9 [ F<=1 !" functional " ] 

In this formula, label "all_good" denotes the set of system states where all components are good. 

P3: The probability of system failure within one hour after a component fails symmetric is less than a 
bound pi. For example, withpi = 10 -6 , 

"one_fault"=>P<=le-6 [ F<=1 !" functional " ] 

This last property allows us to investigate the reliability of a 3 x 3 ROBUS instance after the symmetric 
failure of a single BIU or RMU. 2 

The complete list of properties that we evaluated with PRISM is given in the appendix (A.l). 

2 The PRISM model described here does not include the diagnosis and reconfiguration features of ROBUS, which are intended 
to help survive such a component failure. However, the PRISM model can be extended to take diagnosis into account, as done 
in [22] or [19]. 
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Property 

Parameter 

Result 

Run time 

PI 

T = 5 

0.99999999814 

0.41 s 

PI 

T= 10 

0.99999999359 

0.59 s 

PI 

T = 100 

0.99999945515 

4.15 s 

P2 

Po = 10- 8 

true 

0.28 s 

P2 

Po = IQ" 9 

true 

0.29 s 

P2 

O 

7 

o 

r— 1 

II 

O 

false 

0.29 s 

P3 

Pi = 10- 3 

true 

0.29 s 

P3 

Pi = 10- 4 

true 

0.29 s 

P3 

Pi = 10- 5 

false 

0.28 s 

P3 

PI = 10- 6 

false 

0.28 s 

P3 

p 1 = 10" 7 

false 

0.31 s 


Table 2. Example Analysis Results (3 BIUs, 3 RMUs) 


3.3.3 Analysis Results 

Table 2 shows the results and the runtime of the PRISM model checker on different instances of properties 
PI to P3. This experiment was run on a 64bit Intel Core 2 computer, with 4 GB RAM, and running Linux 2.6 
(Ubuntu 8.04). All the analysis was done using PRISM’s default settings. Memory consumption was not 
very large in this example, so we do not include memory usage data in the table. We used the following 
model parameters: 

• Rates: trans.lambda = 10~ 5 , perm_lambda = 10~ 5 , repair.lambda = 10 

• Probabilities: benign_ratio = 0.5, symmetric_ratio = 0.3 

As can be seen from the table, the model checker is fast in this example. The results are as one would expect. 
There is enough redundancy in this ROBUS instance to achieve very high reliability, of the order of 10 - 9 
system failures per hour. Reliability decreases significantly if one RMU or BIU suffers a (nondiagnosed) 
symmetric fault. 

It is easy to extend the PRISM model to larger versions of ROBUS by adding more copies of the BIU 
and RMU modules. Tables 3 and 4 show experimental results for a 4 x 3 instance (4 BIUs and 3 RMUs), 
and for a 4 x 4 instance. The rate and probability parameters were set as in the previous experiment. 
On these examples, one observes a significant increase in the model checker’s runtime. This is caused by 
the well-known state-explosion problem, as the number of states in the PRISM model (and the Markov 
chain) grows exponentially with the number of system modules. PRISM employs data structures based on 
decision diagrams [14] and sparse matrix representations, to attempt to store the Markov chain compactly, 
but scalability to systems with a large number of modules remains an issue. 

The model checker also uses iterative methods to compute transient probability distributions, and one 
observes that the number of iterations required increases with the number of modules. An increase in the 
number of iterations also explains why evaluating property PI for T = 100 takes much longer than for 
T = 10 or T = 5. 
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Property 

Parameter 

Result 

Run time 

PI 

T = 5 

0.99999999838 

8.99 s 

PI 

T= 10 

0.99999999443 

14.56 s 

PI 

T= 100 

0.99999952577 

98.90 s 

P2 

Po = 10- 8 

true 

5.318 s 

P2 

Po = 10- 9 

true 

5.314 s 

P2 

Po = 10- 10 

false 

5.298 s 

P3 

pi = io - 3 

true 

5.315 s 

P3 

Pi = io - 4 

true 

5.32 s 

P3 

PI = 10- 5 

false 

5.317 s 

P3 

Pi = io - 6 

false 

5.34 s 

P3 

Pi = 10 -7 

false 

5.318 s 


Table 3. Experiment: 4 BIUs and 3 RMUs 


Property 

Parameter 

Result 

Run time 

PI 

T = 5 

0.99999999861 

59.09 s 

PI 

T= 10 

0.99999999520 

103.03 s 

PI 

T = 100 

0.99999959193 

732.9148 s 

P2 

Po = IO " 8 

true 

33.057 s 

P2 

Po = IO " 9 

true 

33.052 s 

P2 

Po = IO " 10 

false 

33.037 s 

P3 

Pi = IO " 3 

true 

33.042 s 

P3 

Pi = IO " 4 

true 

33.074 s 

P3 

Pr = io - 5 

false 

33.045 s 

P3 

Pi = IO " 6 

false 

33.048 s 

P3 

PI = IO - 7 

false 

33.077 s 


Table 4. Experiment: 4 BIUs and 4 RMUs 
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module spider_bius 

good_bius: [0..N] init N; 

benign_transient_bius : [0..N] init 0; 

symmetric_transient_bius : [0..N] init 0; 

asymmetric_transient_bius : [0..N] init 0; 

benign_permanent_bius : [0..N] init 0; 

symmetric_permanent_bius : [0..N] init 0; 

asymmetric_permanent_bius : [0..N] init 0; 

[] good_bius>0 & 

benign_transient_bius<N & benign_permanent_bius<N & 
symmetric_transient_bius<N & symmetric_permanent_bius<N & 
asymmetric_transient_bius<N & asymmetric_permanent_bius<N 

-> 

(biu_trans_lambda * good_bius * biu_benign) : 

(good_bius' =good_bius-l) & 

(benign_transient_bius' = benign_transient_bius+l ) 

+ (biu_trans_lambda * good_bius * biu_symmetric) : 

(good_bius' =good_bius-l ) & 

( symmetric_transient_bius ' = symmetric_transient_bius+l ) 


+ (biu_perm_lambda * good_bius * biu_asymmetric) : 

(good_bius' =good_bius-l) & 

(asymmetric_permanent_bius ' =asymmetric_permanent_bius+l ) ; 

[] (benign_transient_bius>0 & good_bius<N) -> 

(biu_repair_lambda * benign_transient_bius ) : 

(good_bius ' =good_bius+l ) & 

(benign_transient_bius' =benign_transient_bius-l ) ; 


endmodule 


Figure 7. Revised Model for N BIUs 


3.3.4 Revised Model 

Rather than representing each component of the system as a separate PRISM module, a more traditional 
approach is to count the number of BIUs and RMUs in each fault class (cf. [5] for example). This modeling 
method takes advantage of the symmetries present in SPIDER to reduce the number of states in the Markov 
chain and increase scalability. We describe how such a model can be constructed and analyzed with PRISM. 

As previously, the PRISM system is parameterized by fault rates and by the probabilities of each type of 
fault. In addition, this new model is parameterized by N and M — that is, the number of BIUs and RMUs, 
respectively. The system consists of two similar modules; one describes the set of BIUs and the other models 
the RMU faults. The BIU model is sketched in Figure 7. 

This module has seven state variables with values in the interval [0, N], Variable good_bius is the 
number of good BIUs, benign_transient_bius counts the number of BIUs that have suffered a benign, 
transient fault, and so forth. Then, a fault can occur as long as the number of good BIUs is non-zero, and the 
effect of a fault is to decrement goocLbius and increment one of the other counters. The rate for a fault 
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type is the product of the fault arrival rate, the probability of the fault type, and the number of good BIUs. 
For example, the rate of asymmetric, permanent faults is given by 

biu_perm_lambda * good_bius * biu_asymmetric 

as shown in Figure 7. The PRISM model then includes one probabilistic transition to model the occurrence 
of faults. For technical reasons, the guard of this transition is the conjunction 

good_bius>0 & 

benign_transient_bius<N & benign_permanent_bius<N & 
symmetric_transient_bius<N & symmetric_permanent_bius<N & 
asymmetric_transient_bius<N & asymmetric_permanent_bius<N 

Intuitively, the constraint good_bius>0 should be sufficient: faults can occur as long as there are non- 
faulty BIUs. But PRISM requires additional constraints such as benign_transient Joius<N for type 
checking. Without this constraint, it is not possible to increment benign_transient_bius since this 
variable’s value must be in the interval [0,iV]. Adding such constraints does not change the underlying 
Markov chain, as all states of the chain satisfy the following invariant 

good_bius + benign_transient_bius + benign_permanent_bius + 
symmetric_transient_bius + symmetric_permanent_bius + 
asymmetric_transient_bius + asymmetric_permanent_bius = N. 

Recovery from transient faults is modeled in a similar fashion, as shown in Figure 7. 

The fault model for the set of RMUs is identical to the BIU model (cf. the appendix [A.2]), and the 
maximal fault assumption is specified as shown in Figure 8. 

We analyzed the same properties as previously using this revised model. The exact PRISM formulas 
used in this study are listed in the appendix (A.2). Since the model is now parameterized by N and M, we 
can ask PRISM to check these properties for different instances of ROBUS, by giving ranges of values for 
N and M (see [25] for details). 

Table 5 shows the runtimes and results of PRISM verification of property PI for different ROBUS 
instances, where N varies from 3 to 15 and M from 3 to 5, and for different values of the mission time T. 
In this experiment, the rates and fault probability parameters were set as in the previous section: 

• Rates: trans_lambda = 10~ 5 , perm_lambda = 1CU 5 , repair_lambda = 10 

• Probabilities: benign_ratio = 0.5, symmetric_ratio = 0.3 

As expected, this revised model is much more scalable than the previous model and can be used to analyze 
much larger instances of SPIDER. One can also observe that for the chosen fault rates, this model predicts 
that maximal reliability is achieved with about five RMUs and BIUs. With more components, the system 
reliability decreases slightly. This can be explained by the fact that SPIDER cannot tolerate Byzantine 
failures in both RMUs and BIUs, and with more system components, the probability that a least one RMU 
and one BIU are asymmetric increases. 3 

3.4 Protocol Models in PRISM 

A tool such as PRISM has been applied to the verification of randomized leader-election algorithms and 
other distributed protocols. Here, we describe an attempt to model one of SPIDER’s protocols and analyze 
it with PRISM. 

3 This phenomenon may be more pronounced in this model than in real SPIDER implementations, since component diagnosis 
and reconfiguration are not modeled. 
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formula symmetric_bius = 

symmetric_transient_bius+symmetric_permanent_bius ; 
formula asymmetric_bius = 

asymmetric_transient_bius+asymmetric_permanent_bius; 
formula bad_bius = symmetric_bius + asymmetric_bius; 
formula symmetric_rmus = 

symmetric_transient_rmus+symmetric_permanent_rmus; 
formula asymmetric_rmus = 

asymmetric_transient_rmus+asymmetric_permanent_rmus; 
formula bad_rmus = symmetric_rmus + asymmetric_rmus; 

// 

// Maximal fault assumption: 

// - number of good BIUs > number of bad BIUs 
// - number of good RMUs > number of bad RMUs 
// - no asymmetric BIU or no asymmetric RMU 
// 

label "functional" = 

(good_bius > bad_bius) & (good_rmus > bad_rmus) 

& (asymmetric_bius = 0 | asymmetric_rmus = 0); 

Figure 8. Maximal Fault Assumption in PRISM 
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T = 5 

T= 10 

T = 100 

M 

N 

Result 

Time 

Result 

Time 

Result 

Time 

3 

3 

0.999999994380 

0.023 s 

0.999999985032 

0.021 s 

0.999999360021 

0.047 s 

3 

4 

0.999999995108 

0.051 s 

0.999999986972 

0.052 s 

0.999999442968 

0.135 s 

3 

5 

0.999999996704 

0.107 s 

0.999999991222 

0.114s 

0.999999624638 

0.334 s 

3 

6 

0.999999996565 

0.205 s 

0.999999990853 

0.223 s 

0.999999608883 

0.731 s 

3 

7 

0.999999996427 

0.396 s 

0.999999990484 

0.444 s 

0.999999593130 

1.543 s 

3 

8 

0.999999996288 

0.886 s 

0.999999990114 

1.006 s 

0.999999577341 

3.561 s 

3 

9 

0.999999996149 

1.956 s 

0.999999989745 

2.281 s 

0.999999561553 

9.076 s 

3 

10 

0.999999996010 

3.075 s 

0.999999989375 

3.693 s 

0.999999545767 

15.215 s 

3 

11 

0.999999995872 

4.539 s 

0.999999989005 

5.521 s 

0.999999529981 

23.911 s 

3 

12 

0.999999995733 

6.858 s 

0.999999988636 

8.422 s 

0.999999514197 

37.450 s 

3 

13 

0.999999995594 

10.021 s 

0.999999988266 

12.343 s 

0.999999498414 

57.240 s 

3 

14 

0.999999995455 

14.194 s 

0.999999987897 

17.565 s 

0.999999482633 

83.052 s 

3 

15 

0.999999995316 

19.949 s 

0.999999987527 

24.896 s 

0.999999466852 

121.887 s 

4 

3 

0.999999995108 

0.067 s 

0.999999986973 

0.055 s 

0.999999442968 

0.137 s 

4 

4 

0.999999995791 

0.142 s 

0.999999988790 

0.129 s 

0.999999520647 

0.377 s 

4 

5 

0.999999997340 

0.290 s 

0.999999992916 

0.290 s 

0.999999697038 

0.942 s 

4 

6 

0.999999997155 

0.620 s 

0.999999992424 

0.677 s 

0.999999676016 

2.266 s 

4 

7 

0.999999996970 

1.418 s 

0.999999991931 

1.582 s 

0.999999654995 

5.688 s 

4 

8 

0.999999996785 

2.826 s 

0.999999991438 

3.259 s 

0.999999633939 

12.819 s 

4 

9 

0.999999996600 

4.510 s 

0.999999990946 

5.430 s 

0.999999612886 

21.825 s 

4 

10 

0.999999996415 

7.003 s 

0.999999990453 

8.329 s 

0.999999591833 

36.572 s 

4 

11 

0.999999996230 

10.813 s 

0.999999989960 

13.298 s 

0.999999570783 

58.089 s 

4 

12 

0.999999996045 

16.394 s 

0.999999989467 

20.056 s 

0.999999549734 

92.057 s 

4 

13 

0.999999995860 

24.324 s 

0.999999988975 

29.980 s 

0.999999528687 

140.312 s 

4 

14 

0.999999995675 

34.667 s 

0.999999988482 

43.105 s 

0.999999507641 

206.737 s 

4 

15 

0.999999995490 

48.679 s 

0.999999987989 

61.488 s 

0.999999486597 

302.655 s 

5 

3 

0.999999996704 

0.112 s 

0.999999991222 

0.120 s 

0.999999624638 

0.347 s 

5 

4 

0.999999997340 

0.267 s 

0.999999992916 

0.289 s 

0.999999697038 

0.930 s 

5 

5 

0.999999998843 

0.713 s 

0.999999996919 

0.805 s 

0.999999868138 

2.680 s 

5 

6 

0.999999998612 

1.947 s 

0.999999996304 

2.179 s 

0.999999841838 

8.154 s 

5 

7 

0.999999998381 

3.298 s 

0.999999995688 

3.860 s 

0.999999815539 

15.385 s 

5 

8 

0.999999998150 

5.613 s 

0.999999995072 

6.633 s 

0.999999789207 

27.815 s 

5 

9 

0.999999997918 

9.382 s 

0.999999994456 

11.140 s 

0.999999762876 

47.858 s 

5 

10 

0.999999997687 

14.847 s 

0.999999993840 

17.993 s 

0.999999736547 

79.659 s 

5 

11 

0.999999997456 

23.451 s 

0.999999993224 

28.736 s 

0.999999710220 

131.086 s 

5 

12 

0.999999997225 

35.189 s 

0.999999992608 

43.689 s 

0.999999683896 

206.533 s 

5 

13 

0.999999996993 

51.936 s 

0.999999991992 

65.307 s 

0.999999657573 

313.023 s 

5 

14 

0.999999996762 

74.964 s 

0.999999991376 

94.072 s 

0.999999631253 

463.504 s 

5 

15 

0.999999996531 

106.132 s 

0.999999990760 

134.989 s 

0.999999604934 

677.744 s 


Table 5. Example Analysis Results for the Revised SPIDER Model 
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3.4.1 SPIDER’s Broadcast Service 


SPIDER provides a broadcast service to the PEs by means of a two-round interactive consistency protocol 
described in [33]. If SPIDER’s maximal fault assumption is satisfied, then this protocol is guaranteed to 
achieve consistent broadcast: 

• If at the end of the protocol, a fault-free PE receives message m, then all fault-free PEs receive the 
exact same m. 

• If the BIU and originating PE are fault free, then m is identical to the message broadcast. 

We consider a simplified version of this protocol that does not take diagnosis and group membership 
into account. This simplified protocol works as follows. When a PE is scheduled to transmit data, it sends a 
message m to its BIU. This BIU checks that the message is well-formed (e.g., whether the CRC is valid). If 
the message m is valid, the BIU forwards it to all the RMUs; otherwise, the BIU sends the special message 
PE.ERROR. Every RMU checks the validity of the message it receives. If the message is valid, the RMU 
resends it to all BIUs. Otherwise, the RMU sends SOURCE_ERROR to all BIUs. At this point, each BIU 
receives M messages m \, . . . , mM and it performs a vote on this set of messages. If more than half of these 
messages are identical, then the majority copy is selected by the BIU and is relayed to the attached PE. 
Otherwise, the BIU forwards a special message NO_MAJORITY to the PE. 

Our goal is now to examine the probability that this protocol achieves consistent broadcast, under some 
assumptions about the rates and types of failure of BIUs and RMUs. 

3.4.2 PRISM Model 

We have developed a PRISM model of the interactive consistency protocol, for a ROBUS instance with three 
BIUs and three RMUs. The PRISM model is actually even more simplified: it describes a single execution 
of the protocol, where PE\ is the broadcast source. The full model is given in the appendix (B). 

Unlike all the PRISM examples described previously, this model is a discrete-time Markov decision 
process. It mixes probabilistic state transitions that model failure probabilities, and nondeterministic transi- 
tions that model the behavior of faulty components. A nonfaulty component executes the protocol properly, 
but may become either benign, symmetric, or asymmetric faulty with some probability. Once a component 
is symmetric or asymmetric, it can exhibit a range of incorrect behaviors. We model the set of possible 
behaviors as a nondeterministic choice between alternatives, to let the PRISM model checker discover the 
worst-case scenarios. 

The modeling follows a few general principles. Each communication link is modeled as a one-place 
buffer, represented as a global system variable in PRISM. Protocol components are modeled as state- 
transition systems (i.e., PRISM modules). The transmission of a message is modeled by writing the message 
into the global variable representing the appropriate channel. Conversely, the reception of a message is mod- 
eled by reading the channel variable. To properly coordinate the read and write operations, the transmitting 
and receiving modules synchronize using transition labels. 

The constant and global variables shown in Figure 9 model the protocol messages and the communica- 
tion channels. Messages are represented by integers in the interval [0, 7]. There are three types of messages: 
normal data, special control messages, and the special value bad_message that represents any manifestly 
incorrect message. This special value is intended to model missing messages, messages with a wrong CRC, 
or any other other type of error that can be recognized by the recipient. 

Because PRISM requires all types to be finite, we represent the possible valid data by the three constants 
datal, data2, and data3. At least three good values are necessary to model fault scenarios such as the 
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const int datal = 0; 

const int data2 = 1; 

const int data3 = 2; 

const int pe_error = 4; 

const int source_error = 5; 
const int no_majority = 6; 

const int bad_message = 7; 

// Global variables = communication links 


global 

pel_ 

biul: 

[0. . 

■ 7 ] 

init 

0; 

global 

pe2_ 

biu2 : 

[0. . 

■ 7 ] 

init 

0; 

global 

pe3_ 

biu3 : 

[0. . 

■ 7 ] 

init 

0; 

global 

biul. 

_rmu 1 : 

[0. 

. .7] 

init 

0; 

global 

biul. 

_rmu2 : 

[0. 

. .7] 

init 

0; 

global 

biul. 

_rmu3 : 

[0. 

. .7] 

init 

0; 

global 

biu2 

_rmul : 

[0. 

. .7] 

init 

0; 

global 

biu2 

_rmu2 : 

[0. 

. .7] 

init 

0; 

global 

biu2 

_rmu3 : 

[0. 

. .7] 

init 

0; 

global 

biu3. 

_rmul : 

[0. 

. .7] 

init 

0; 

global 

biu3. 

_rmu2 : 

[0. 

. .7] 

init 

0; 

global 

biu3. 

_rmu3 : 

[0. 

. .7] 

init 

0; 


// Source data that PEI will broadcast 
const int source = data3; 


Figure 9. Messages and Communication Channels 


transmission of distinct messages by an asymmetric BIU to the three RMUs, without any of these messages 
being recognized as incorrect by the recipient. 4 

The global variables such as pel Joiul represent bidirectional channels between two components (e.g., 
between PEI and BIU1). 

The protocol processes are modeled as illustrated in Figure 10. The figure shows a fragment of the PEI 
model, which consists of two modules: peLerror is the fault model for PEI, and pel models protocol 
execution. 

Module peLerror specifies that PEI may suffer a single fault at an arbitrary time during protocol 
execution. The time of occurrence is nondeterministic, but the fault type is randomly selected according to 
the probability distribution defined by the parameters pe_benign and pe_symmetric. 

Module pel has a main control variable pel_ctrl that specifies successive steps of the protocol. For 
example, from state 0, the process initiates a broadcast by transmitting the source message to BIU1 and 
then transitions to the next step, where pel_ctrl=l. The actual message transmitted depends on PEl’s 
current fault status. If PEI is fault free, then the source data is sent; if PEI is benign, then the badunes sage 
is sent (BIU1 will recognize this message as incorrect); otherwise, PEI has suffered a symmetric fault and it 
may send arbitrary data to BIU1. We model the latter case, as a nondeterministic choice between three state 
transitions, as shown in Figure 10. 

In all cases, the broadcast action sets pel_ctrl to 1, and the only possible transition of process pel 

4 One could argue that more good data values would allow us to model more general faulty behaviors, such as, an asymmetric 
RMU sending three distinct messages that are also different from the message it receives. Unfortunately, we were not able to add 
an extra data4 value, as this causes PRISM to run out of memory. 
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module pel_error 

pel_faulted: bool init false; 
pel_error: [0..2] init fault_free; 


[] ( ! pel_f aulted) -> 

pe_benign : 

+ pe_symmetric : 
+ pe_nofault: 


(pel_error ' =benign_f aulty) & (pel_f aulted' =true) 
(pel_error' =symmetric_f aulty ) & (pel_f aulted' =true) 
(pel_error' =f ault_f ree ) & (pel_f aulted' =true) ; 


endmodule 


module pel 

pel_ctrl: [0..4] init 0; 

pel_out: [0..7] init 0; 


// broadcast 

[] (pel_ctrl=0) & (pel_error=fault_f ree) -> (pel_ctrl' =1) & 

(pel_biul' = source) ; 


[] 

[] 

[] 


(pel_ctrl=0) & 
(pel_ctrl=0) & 
(pel_ctrl=0) & 


(pel_error=symmetric_f aulty) 
(pel_error=symmetric_f aulty) 
(pel_error=symmetric_f aulty ) 


-> (pel_ctrl ' =1 ) & 

(pel_biul ' =datal ) ; 
-> (pel_ctrl ' =1 ) & 

(pel_biul ' =data2 ) ; 
-> (pel_ctrl ' =1 ) & 

(pel_biul ' =data3 ) ; 


[pel_broadcast ] (pel_ctrl=l) -> (pel_ctrl' =2 ) ; 


endmodule 


Figure 10. Broadcast Protocol: Process PEI 
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// BIU1 
module biul 

biul_ctrl: [0..6] init 0; 

// wait for broadcast from pel 

[pel_broadcast ] (biul_ctrl=0 ) -> (biul_ctrl' =1 ) ; 

// Fault-free case: consume the message and forward it to all rmus 
[] (biul_ctrl=l ) & (biul_error=f ault_f ree) & (pel_biul ! =bad_message) -> 
(biul_ctrl' =2) & (biul_rmul ' =pel_biul ) & 

(biul_rmu2 ' =pel_biul ) & (biul_rmu3 ' =pel_biul ) ; 

[] (biul_ctrl=l ) & (biul_error=f ault_f ree) & (pel_biul=bad_message) -> 
(biul_ctrl' =2) & (biul_rmul ' =pe_error ) & 

(biul_rmu2' =pe_error) & (biul_rmu3 ' =pe_error ) ; 


Figure 11. Broadcast Protocol: Process BIU1 

after the broadcast increments pel_ctrl. This transition has the label pel_broadcast, and its role 
is to synchronize pel and biul, so that the latter can consume the message just transmitted. Figure 11 
shows the corresponding specification fragment for module biul. Initially, biul is in state 0. It stays in 
that state until it can execute the transition labeled with pelibroadcast, which increments biul_ctrl 
so that the next transitions from biul are enabled. Since all transitions that share the same label must 
execute simultaneously, both pel and biul perform this transition: that is, just after pel has transmitted 
a message, and just before biul reads that message. A similar mechanism, based on synchronized labeled 
transitions is used throughout the PRISM specification to model all communication. 

3.4.3 Protocol Properties 

The full protocol model is specified in the appendix (B). For scalability reasons, we have included fault 
models for only two components — pel and biul. All the other PEs and BIUs, and all the RMUs are 
assumed to be fault free. Several sets of states are identified and used in property specifications: 

label "done" = pel_ctrl=4 & pe2_ctrl=2 & pe3_ctrl=2; 

label "all_good" = (pel_out=source) & (pe2_out=source) 

& (pe3_out=source) ; 

label "unanimous" = (pel_out=pe2_out ) & (pel_out=pe3_out ) ; 
label "nonf aulty_agree " = (pe2_out=pe3_out) ; 

label "nonf aulty_valid" = (pe2_out=source) & (pe3_out=source) ; 

The set of states called done indicates protocol termination. The other sets of states capture two variants 
of the consistency and validity properties that the protocol is intended to achieve. For example, all_good 
means that all PEs receive the source message, whereas unanimous is the weaker property that all PEs 
receive the same thing. 

We examined the protocol properties listed in the appendix (B). First, we expect that the protocol always 
terminates, so the following PRISM property should be true: 

"init"=>P>=l [ F "done" ] 
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Then, we investigate the probabilities of agreement and validity. Since the PRISM model is a Markov 
decision process, the actual probabilities depend on how the nondeterministic choices are resolved. For 
example, we can consider two variants of the same property: 

Pmin =? [ GC'done" => "all_good") ] 

Pmax =? [ GC'done" => "all_good") ] 

In the first line, we ask for the minimal probability that the protocol ends in a state where all PEs receive the 
broadcast as transmitted. This is the probability of success under the worst possible fault scenario allowed 
in our model. Conversely, the second line asks for the maximal probability of success (which should be 1). 

3.4.4 Analysis Results 

Table 6 shows the results and runtime we observed when running the PRISM model checker on the above 
example. It also shows model-construction time and the amount of memory consumed by the PRISM 
process. This data was measured on a 64bit Linux 2.6 system, with an Intel Core 2 CPU, and 4 GB of RAM. 
The results show that it is possible to analyze these example properties in about 8 minutes. However, the 
performance numbers are disappointing given the size of the example model. Most of the verification time 
is spent by PRISM in building an internal representation of the MDP. The internal representation is based on 
multi-terminal binary decision diagrams (MTBDDs), which appear to be a very poor representation for this 
model. In particular, the amount of memory used to store the MTBDDs and other internal data structures is 
enormous; it requires more than 3 GB. This seems to be a serious limitation of the MTBDD representation, 
as the statistics displayed by PRISM indicate that the actual size of the underlying MDP is not that large: it 
contains 15446 states and 50357 transitions. 5 

It is well-known that the variable ordering used by an MTBDD can have a major impact on the size 
of this MTBDD. But there is no obvious way to control which ordering PRISM will use. All our efforts 
at reorganizing the specifications by changing module and declaration order did not have an impact on 
the analysis time or memory usage. There has been some work on heuristics for obtaining good variable 
ordering for MTBDDs in PRISM [20], but these heuristics were not implemented in the PRISM version that 
we used. 

4 Conclusion 

We have described several applications of the PRISM probabilistic model checker to fault-tolerant dis- 
tributed systems. 

We first investigated the use of PRISM in reliability modeling, as an alternative to other tools such 
as SURE or Mobius. In this study, we considered two different approaches to modeling the reliability of 
the SPIDER system. Both approaches relied on continuous-time Markov chains. The main result of this 
experiment is that PRISM is perfectly adequate as a reliability analysis tool. The PRISM specification 
language is expressive enough and enables easy modeling. The model checker performance is also good, 
and PRISM can be used to analyze relatively large and detailed models. There is still a limitation to this 
approach, since PRISM cannot encode or analyze non-Markovian models, unlike other tools that support 
semi-Markovian models or more general classes of stochastic systems. 

In a second experiment, we examined application of PRISM to the analysis of a more detailed SPIDER 
protocol, encoded as a state machine. The PRISM model developed for this purpose was a Markov decision 

5 We have observed very high memory usage (gigabytes) on much smaller MDPs with fewer than a hundred states and a few 
hundred transitions. 
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Property 

Result 

Runtime 

"init"=>P>=l [ F "done" ] 

true 

1.801 s 

P>=1 [ G ( " done"=>"all_good" ) ] 

false 

3.041 s 

P>=1 [ G ( "done " => " unanimous " ) 1 

false 

1.950 s 

Pmin =? [ GC'done" => "unanimous") ] 

0.7 

1.620 s 

Pmax =? [GC'done" => "unanimous")] 

1.0 

0.727 s 

Pmin =? [G ("done" => "nonfaulty_agree" ) ] 

1.0 

0.021 s 

Pmax =? [G ("done" => "nonfaulty_agree" ) ] 

1.0 

0.060 s 

Pmin =? [G ("done" => "nonf aulty_valid" ) ] 

0.56 

2.195 s 

Pmax =? [G ("done" => "nonf aulty_valid" ) ] 

1.0 

0.473 s 


Model Construction: 450.023 s 

Virtual Memory Used: 3676 Mb 

Resident Memory Used: 2358 Mb 


Table 6. Broadcast Protocol: Analysis Results 


process, in which fault occurrence and the behavior of faulty components are modeled via nondeterministic 
choices, while the type of fault is determined by a probability distribution given a priori. This experiment 
showed that it is possible to model and analyze a fault-tolerant protocol using PRISM, but it also showed 
some limits of the PRISM tool. Although the protocol modeled was quite small, PRISM required a very 
large amount of memory to build and store its internal data structures for this Markov decision process. 
Scalability to more complex protocols or systems is then questionable. 
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Appendix A 


Reliability Models of SPIDER 


A.l Baseline Model 

// 

// Fault/reliability model for a SPIDER instance 
// modeled as a continuous-time Markov chain 
// 

// This model includes 3 BIUs and 3 RMUs 

// 

// All components are assumed to have similar fault models parameterized by: 

// perm_lambda: rate of permanent faults 

// trans_lambda : rate of transient faults 

// repair_lambda : recovery rate from transient faults 

// 

// Three classes of faults are considered: benign, symmetric, and asymmetric 

// 

// Two parameters define the probability that a fault is benign or symmetric 
// benign_ratio : probability that a fault is benign 
// symmetric_ratio : probability that a fault is symmetric 
// asymmetric_ratio : probability that a fault is asymmetric 
// = (1 - benign_ratio - symmetric_ratio) 

// 

ctmc 


// Error states of each component 

const int good = 0; 

const int benign = 1; 

const int symmetric = 2; 

const int asymmetric = 3; 


// Parameters: arrival rates for each component 

const double perm_lambda; // permanent faults (e.g., 10~-6) 

const double trans_lambda; // transient faults (e.g., 10~-5) 

const double repair_lambda; // recovery from transient faults (e.g., 10) 

// Fractions of benign/symmetric/asymmetric faults 
// all must be between 0 and 1 and they must add to 1 
const double benign_ratio; // 0.5 

const double symmetric_ratio; // 0.3 

const double asymmetric_ratio = (1 - benign_ratio - symmetric_ratio) ; 

// Fault arrival rates for the RMUs ( faults/hour ) 
const double rmu_perm_lambda = perm_lambda; 
const double rmu_trans_lambda = trans_lambda; 
const double rmu_repair_lambda = repair_lambda; 
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// Fraction of fault types: 
const double rmu_benign = benign_ratio; 
const double rmu_symmetric = symmetric_ratio; 
const double rmu_a symmetric = asymmetric_ratio; 


// Fault arrival rates for the BIUs 
const double biu_perm_lambda = perm_lambda; 
const double biu_trans_lambda = trans_lambda; 
const double biu_repair_lambda = repair_lambda; 


// Fractions 

const double biu_benign = benign_ratio; 
const double biu_symmetric = symmetric_ratio; 
const double biu_asymmetric = asymmetric_ratio; 


// Error model for BIU1 
module biul 

biul_state: [0..3] init good; 

biul_permanent : bool init false; 

[] biul_state = good -> 

(biu_perm_lambda * biu_benign) : (biul_state' =benign) & 

(biul_permanent' =true) 

+ (biu_perm_lambda * biu_symmetric) : (biul_state' =symmetric) & 

(biul_permanent ' =true) 

+ (biu_perm_lambda * biu_asymmetric) : (biul_state' =asymmetric) & 

(biul_permanent ' =true) 

+ (biu_trans_lambda * biu_benign) : (biul_state' =benign) & 

(biul_permanent ' =false) 

+ (biu_trans_lambda * biu_symmetric) : (biul_state' =symmetric) & 

(biul_permanent ' =false) 

+ (biu_trans_lambda * biu_asymmetric) : (biul_state' =asymmetric) & 

(biul_permanent ' =f alse ) ; 

// recovery from transient faults 

[] ( ! biul_permanent & biul_state != good) -> 

biu_repair_lambda : (biul_state' =good) & (biul_permanent' =f alse) ; 


endmodule 


// Same model 
module biu2 = 

endmodule 


for BIU2 and BIU3, modulo renaming 
biul [biul_state=biu2_state, 

biul_permanent=biu2_permanent ] 


module biu3 
endmodule 


biul [biul_state=biu3_state, 

biul_permanent=biu3_permanent ] 
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// Error model for RMU1 : same as for BIU1 modulo renaming 
module rmul = biul [biul_state=rmul_state, 

biul_permanent=rmul_permanent , 

b i u_p e r m_l ambda = r mu_p e rm_l amb da , 

biu_trans_lambda=rmu_trans_lambda, 

biu_benign=rmu_benign, 

biu_symmetric=rmu_symmetric, 

biu_asymmetric=rmu_asymmetric] 

endmodule 


// Same model 
module rmu2 = 

endmodule 


for RMU2 and RMU3 too 
rmul [rmul_state=rmu2_state, 

rmul_permanent=rmu2_permanent ] 


module rmu3 
endmodule 


rmul [rmul_state=rmu3_state, 

rmul_permanent=rmu3_permanent ] 


// 

// Maximum fault assumption for SPIDER 

/ / - at least two good BIUs or one good BIU and two benign BIUs 

/ / - at least two good RMUs or one good RMU and two benign RMUs 

/ / - at most one asymmetric BIU or RMU 

// 

formula two_good_bius = 

(biul_state=good & biu2_state=good) | (biul_state=good & biu3_state=good) 

I (biu2_state=good & biu3_state=good) ; 

formula one_good_two_benign_bius = 

(biul_state=good & biu2_state=benign & biu3_state=benign) 

[ (biul_state=benign & biu2_state=good & biu3_state=benign) 

I (biul_state=benign & biu2_state=benign & biu3_state=good) ; 

formula no_asymmetric_bius = 

biul_state ! =asymmetric & biu2_state ! =asymmetric & biu3_state ! =asymmetric; 


formula two_good_rmus = 

(rmul_state=good & rmu2_state=good) | (rmul_state=good & rmu3_state=good) 

| (rmu2_state=good & rmu3_state=good) ; 

formula one_good_two_benign_rmus = 

(rmul_state=good & rmu2_state=benign & rmu3_state=benign) 

| (rmul_state=benign & rmu2_state=good & rmu3_state=benign) 

| (rmul_state=benign & rmu2_state=benign & rmu3_state=good) ; 

formula no_asymmetric_rmus = 

rmul_state ! =asymmetric & rmu2_state ! =asymmetric & rmu3_state ! =asymmetric; 


label "functional" = 

(two_good_bius | one_good_two_benign_bius ) 
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& (two_good_rmus | one_good_two_benign_rmus ) 
& (no_asymmetric_bius | no_asymmetric_rmus ) ; 


// 

// Other system states used in properties 

// 

formula all_bius_good = 

(biul_state=good) & (biu2_state=good) & (biu3_state=good) ; 
formula all_rmus_good = 

(rmul_state=good) & (rmu2_state=good) & (rmu3_state=good) ; 


// No faulty component 

label "all_good" = all_bius_good & all_rmus_good; 

// 

// To test reliability after a symmetric fault 

// 

formula one_symmetric_biu = 

(biul_state=symmetric & biu2_state=good & biu3_state=good) 

I (biul_state=good & biu2_state=symmetric & biu3_state=good) 

I (biul_state=good & biu2_state=good & biu3_state=symmetric) ; 

formula one_symmetric_rmu = 

(rmul_state=symmetric & rmu2_state=good & rmu3_state=good) 

| (rmul_state=good & rmu2_state=symmetric & rmu3_state=good) 

| (rmul_state=good & rmu2_state=good & rmu3_state=symmetric) ; 


label "one_fault" = 

(one_symmetric_biu & all_rmus_good) | (one_symmetric_rmu & all_bius_good) ; 


rewards "total_time" 
true: 1; 
endrewards 


Properties for the previous model 

// Probability that no component fails over a lOhour run 
P=? [ G<=10 "all_good" ] 

// Expected time to the first component failure 
R{ "total_time" }=? [ F ! "all_good" ] 


// Probability that the system does not fail over a 5hour run 
P=? [ G<=5 "functional" ] 
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// Probability that the system does not fail over a lOhours run 
P=? [ G<=10 "functional" ] 

// Probability that the system does not fail over a lOOhour run 
P=? [ G<=100 "functional" ] 


// The probability of system failure over a Ihour run is less than 10~-8 
"all_good"=>P<=le-8 [ F<=1 !" functional " ] 

// The probability of system failure over a Ihour run is less than 1CP-9 
"all_good"=>P<=le-9 [ F<=1 !" functional " ] 

// The probability of system failure over a Ihour run is less than 10~-10 
"all_good"=>P<=le-10 [ F<=1 !" functional " ] 


// Probability of system failure after a one hour run is less than le-3, 

// even if we start with a single benign component failure. 

"one_f ault "=>P<=le-3 [ F<=1 !" functional " ] 

// Probability of system failure after a one hour run is less than le-4, 

// even if we start with a single benign component failure. 

"one_f ault "=>P<=le-4 [ F<=1 !" functional " ] 

// Probability of system failure after a one hour run is less than le-5, 

// even if we start with a single benign component failure. 

"one_f ault "=>P<=le-5 [ F<=1 !" functional " ] 

// Probability of system failure after a one hour run is less than le-6, 

// even if we start with a single benign component failure. 

"one_f ault "=>P<=le-6 [ F<=1 !" functional " ] 

// Probability of system failure after a one hour run is less than le-7, 
// even if we start with a symmetric component. 

"one_f ault "=>P<=le-7 [ F<=1 !" functional " ] 

// Probability of system failure after a one hour run is less than le-8, 

// even if we start with a single benign component failure. 

"one_f ault "=>P<=le-8 [ F<=1 !" functional " ] 
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A.2 Revised Model 


// 

// Fault/reliability model for SPIDER 

// modeled as a continuous-time Markov chain 

// 

// This model uses counters for the number of good and faulty RMUs and BIUs 
// Two parameters are used: N = number of BIUs, M = number of RMUs 
// 

// All components are assumed to have similar fault models parameterized by: 

// perm_lambda: rate of permanent faults 

// trans_lambda : rate of transient faults 

// repair_lambda : recovery rate from transient faults 

// 

// Three classes of faults are considered: benign, symmetric, and asymmetric 

// 

// Two parameters define the probability that a fault is benign or symmetric 
// benign_ratio : probability that a fault is benign 
// symmetric_ratio : probability that a fault is symmetric 
// asymmetric_ratio : probability that a fault is asymmetric = 

// (1 - benign_ratio - symmetric_ratio) 

// 

ctmc 

// Parameters: arrival rates for each component 

const double perm_lambda; // permanent faults (e.g., 10~-6) 

const double trans_lambda; // transient faults (e.g., 10~-5) 

const double repair_lambda; // recovery from transient faults (e.g., 10) 

// Fractions of benign/symmetric/asymmetric faults 

// all must be between 0 and 1 and they must add to 1 

const double benign_ratio; // e.g., 0.5 

const double symmetric_ratio; // e.g., 0.3 

const double asymmetric_ratio = (1 - benign_ratio - symmetric_ratio) ; 

// Fault arrival rates for the RMUs (faults/hour) 
const double rmu_perm_lambda = perm_lambda; 
const double rmu_trans_lambda = trans_lambda; 
const double rmu_repair_lambda = repair_lambda; 

// Fraction of fault types: 
const double rmu_benign = benign_ratio; 
const double rmu_symmetric = symmetric_ratio; 
const double rmu_a symmetric = asymmetric_ratio; 


// Fault arrival rates for the BIUs 
const double biu_perm_lambda = perm_lambda; 
const double biu_trans_lambda = trans_lambda; 
const double biu_repair_lambda = repair_lambda; 

// Fractions 

const double biu_benign = benign_ratio; 
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const double biu_symmetric = symmetric_ratio; 
const double biu_asymmetric = asymmetric_ratio; 


// Total number of BIUs and RMUs 
const int N; // number of BIUs 
const int M; // number of RMUs 

module spider_bius 

good_bius: [0..N] init N; 

benign_transient_bius : [0..N] init 0; 

symmetric_transient_bius : [0..N] init 0; 

asymmetric_transient_bius : [0..N] init 0; 

benign_permanent_bius : [0..N] init 0; 

symmetric_permanent_bius : [0..N] init 0; 

asymmetric_permanent_bius : [0..N] init 0; 

// Fault of one good BIU 

// NOTE: prism complains if we give only ' good_bius>0 ' as precondition 
[] good_bius>0 & 

benign_transient_bius<N & benign_permanent_bius<N & 
symmetric_transient_bius<N & symmetric_permanent_bius<N & 
asymmetric_transient_bius<N & asymmetric_permanent_bius<N 

-> 

(biu_trans_lambda * good_bius * biu_benign) : 

(good_bius' =good_bius-l) & 

(benign_transient_bius' = benign_transient_bius+l ) 

+ (biu_trans_lambda * good_bius * biu_symmetric) : 

(good_bius' =good_bius-l) & 

( symmetric_transient_bius ' = symmetric_transient_bius+l ) 

+ (biu_trans_lambda * good_bius * biu_asymmetric) : 

(good_bius' =good_bius-l) & 

(asymmetric_transient_bius ' =asymmetric_transient_bius+l ) 

+ (biu_perm_lambda * good_bius * biu_benign) : 

(good_bius' =good_bius-l) & 

(benign_permanent_bius' = benign_permanent_bius+l ) 

+ (biu_perm_lambda * good_bius * biu_symmetric) : 

(good_bius' =good_bius-l) & 

( symmetric_permanent_bius ' = symmetric_permanent_bius+l ) 

+ (biu_perm_lambda * good_bius * biu_asymmetric) : 

(good_bius' =good_bius-l) & 

(asymmetric_permanent_bius ' =asymmetric_permanent_bius+l ) ; 

// recovery from transient faults 
[] (benign_transient_bius>0 & good_bius<N) -> 

(biu_repair_lambda * benign_transient_bius ) : 

(good_bius' =good_bius+l ) & 

(benign_transient_bius ' =benign_transient_bius-l ) ; 

[] (symmetric_transient_bius>0 & good_bius<N) -> 

(biu_repair_lambda * symmetric_transient_bius ) : 

(good_bius' =good_bius+l ) & 

( symmetric_transient_bius' =symmetric_transient_bius-l ) ; 


35 


[] (asymmetric_transient_bius>0 & good_bius<N) -> 

(biu_repair_lambda * asymmetric_transient_bius ) : 

(good_bius ' =good_bius+l ) & 

(asymmetric_transient_bius ' =asymmetric_transient_bius-l ) ; 


endmodule 


// RMU error model: same thing modulo renaming 
module spider_rmus = 
spider_bius [N=M, 

good_bius=good_rmus, 

benign_transient_bius=benign_transient_rmus , 

symmetric_transient_bius=symmetric_transient_rmus, 

asymmetric_transient_bius=asymmetric_transient_rmus, 

benign_permanent_bius=benign_permanent_rmus , 

symmetric_permanent_bius=symmetric_permanent_rmus, 

asymmetric_permanent_bius=asymmetric_permanent_rmus, 

biu_trans_lambda=rmu_trans_lambda, 

b i u_pe r m_l amb da= r mu_p e rm_l amb da , 

biu_repair_lambda=rmu_repair_lambda, 

biu_benign=rmu_benign, 

b i u_s ymme t r i c= rmu_s y mme trie, 

biu_asymmetric=rmu_asymmetric] 

endmodule 


// Abbreviations 
formula symmetric_bius = 

symmetric_transient_bius + symmetric_permanent_bius ; 
formula asymmetric_bius = 

asymmetric_transient_bius+asymmetric_permanent_bius ; 
formula bad_bius = symmetric_bius + asymmetric_bius ; 


formula symmetric_rmus = 

symmetric_transient_rmus+symmetric_permanent_rmus ; 
formula asymmetric_rmus = 

asymmetric_transient_rmus+asymmetric_permanent_rmus ; 
formula bad_rmus = symmetric_rmus + asymmetric_rmus ; 


// 

// Maximal fault assumption: 

II- number of good BIUs > number of bad BIUs 
II- number of good RMUs > number of bad RMUs 
II- no asymmetric BIUs or no asymmetric RMUs 
// 

label "functional" = 
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(good_bius > bad_bius) & (good_rmus > bad_rmus) 
& (asymmetric_bius=0 | asymmetric_rmus=0 ) ; 


// No component faulty 

label "all_good" = (good_bius=N) & (good_rmus=M) ; 

// At most one symmetric faulty component 
label "one_fault" = 

(good_bius>=N-l) & (good_rmus>=M-l ) 

& (asymmetric_bius=0 ) & (asymmetric_rmus=0 ) 

& ( ( (symmetric_bius=l ) & (symmetric_rmus=0) ) 

I ( ( symmetric_bius=0 ) & ( symmetric_rmus=l ) ) ) ; 

// For expected times 
rewards "total_time" 
true: 1; 
endrewards 


Properties for the previous model 

// Probability that no component fails over a lOhour run 
P=? [ G<=10 "all_good" ] 

// Expected time to the first component failure 
R{ "total_time" }=? [ F ! "all_good" ] 


// Probability that the system does not fail over a 5hour run 
P=? [ G<=5 "functional" ] 

// Probability that the system does not fail over a lOhours run 
P=? [ G<=10 "functional" ] 

// Probability that the system does not fail over a lOOhour run 
P=? [ G<=100 "functional" ] 


// The probability of system failure over a Ihour run is less than 10~-9 
"all_good"=>P<=le-9 [ F<=1 !" functional " ] 


// Probability of system failure after a one hour run is less than le-9, 
// even if we start from a start with a single benign component failure. 
"one_f ault "=>P<=le-9 [ F<=1 !" functional " ] 


// Probability of system failure after a one hour run is less than le-8, 
// even if we start from a start with a single benign component failure. 
"one_f ault "=>P<=le-8 [ F<=1 !" functional " ] 


// Probability of system failure after a one hour run is less than le-8, 
// even if we start from a start with a single benign component failure. 
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"one_f ault "=>P<=le-7 [ F<=1 !" functional " ] 

// Probability of system failure after a one hour run is less than le-8 
// even if we start from a start with a single benign component failure 
"one_fault"=>P<=le-3 [ F<=1 !" functional " ] 

// Probability of system failure after a one hour run is less than le-8 
// even if we start from a start with a single benign component failure 
"one_fault"=>P<=le-4 [ F<=1 !" functional " ] 

// Probability of system failure after a one hour run is less than le-8 
// even if we start from a start with a single benign component failure 
"one_f ault "=>P<=le-5 [ F<=1 !" functional " ] 
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Appendix B 


Model of SPIDER’s Broadcast Service 


mdp 

// Messages: encoded in the range [0 .. 7] 

// datal .. data3 are data messages 

// pe_error, source_error , no_majority: special messages 

// bad_message: manifest fault 

const int datal = 0; 

const int data2 = 1; 

const int data3 = 2; 

const int pe_error = 4; 

const int source_error = 5; 

const int no_majority = 6; 

const int bad_message = 7; 

// Error states of each component 
const int fault_free = 0; 
const int benign_f aulty = 1; 
const int symmetric_f aulty = 2; 
const int asymmetric_faulty = 3; 

// Transition probabilities for BIU error model 
const double biu_benign = 0.1; 
const double biu_symmetric = 0.1; 
const double biu_asymmetric = 0.1; 

formula biu_nofault = (1 - biu_benign - biu_symmetric - biu_asymmetric) ; 

// Transition probabilities for PE error model 
const double pe_benign = 0.1; 
const double pe_symmetric = 0.1; 

formula pe_nofault = (1 - pe_benign - pe_symmetric) ; 

// Transition probabilities for RMU error model 
const double rmu_benign = 0.1; 
const double rmu_symmetric = 0.1; 
const double rmu_a symmetric = 0.1; 

formula rmu_nofault = (1 - rmu_benign - rmu_symmetric - rmu_a symmetric) ; 

// Global variables = communication links 
// each link is an integer variable 


global 

pel_biul : 

[0. . 

.7] 

init 

0; 

global 

pe2_biu2 : 

[0. . 

.7] 

init 

0; 

global 

pe3_biu3 : 

[0. . 

.7] 

init 

0; 

global 

biul_rmul : 

[ 0 . 

. .7] 

init 

0; 

global 

biul_rmu2 : 

[ 0 . 

. .7] 

init 

0; 

global 

biul_rmu3 : 

[ 0 . 

. .7] 

init 

0; 

global 

biu2_rmul : 

[ 0 . 

. .7] 

init 

0; 

global 

biu2_rmu2 : 

[ 0 . 

. .7] 

init 

0; 
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global biu2_rmu3 : [0..7] init 0; 

global biu3_rmul : [0..7] init 0; 

global biu3_rmu2 : [0..7] init 0; 

global biu3_rmu3: [0..7] init 0; 

// source data: what PEI will broadcast 
// it must be valid data 
// 

// NOTE: prism does not allow me to specify that source must 
// be equal to datal, data2, or data3 here, 
const int source = data3; 


// Majority vote for BIU1 when it receives three messages 
// in biul_rmul, biul_rmu2, biul_rmu3 
formula votel = 

(biul_rmul=biul_rmu2 ) ? ( (biul_rmul ! =bad_message) ? biul_rmul : biul_rmu3) 
: (biul_rmu2=biul_rmu3 ) ? ( (biul_rmu2 ! =bad_message) ? biul_rmu2 : biul_rmul) 
: no_majority; 


// Same thing for BIU2 
formula vote2 = 

(biu2_rmul=biu2_rmu2 ) 
: (biu2_rmu2=biu2_rmu3 ) 
: no_majority; 


? ( (biu2_rmul ! =bad_message) ? biu2_rmul 
? ( (biu2_rmu2 ! =bad_message) ? biu2_rmu2 


biu2_rmu3 ) 
biu2_rmul ) 


// ERROR MODEL FOR PEI 
module pel_error 

pel_faulted: bool init false; 
pel_error: [0..2] init fault_free; 

[] ( ! pel_f aulted) -> 

pe_benign: (pel_error' =benign_faulty ) & (pel_f aulted' =true) 

+ pe_symmetric : (pel_error' =symmetric_f aulty ) & (pel_f aulted' =true) 

+ pe_nofault: (pel_error' =f ault_f ree ) & (pel_f aulted' =true) ; 


endmodule 


// PEI 
module pel 

pel_ctrl: [0..4] init 0; 

pel_out: [0..7] init 0; 


// broadcast 
[] (pel_ctrl=0) & 

[] (pel_ctrl=0) & 

[] (pel_ctrl=0) & 

[] (pel_ctrl=0) & 

[] (pel_ctrl=0) & 


(pel_error=fault_f ree) -> (pel_ctrl' =1) & 

(pel_biul' = source) ; 

(pel_error=benign_f aulty ) -> (pel_ctrl ' =1 ) & 

(pel_biul ' =bad_message) ; 

(pel_error=symmetric_f aulty ) -> (pel_ctrl ' =1 ) & 

(pel_biul ' =datal ) ; 

(pel_error=symmetric_f aulty ) -> (pel_ctrl ' =1 ) & 

(pel_biul ' =data2 ) ; 

(pel_error=symmetric_f aulty) -> (pel_ctrl ' =1 ) & 
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(pel_biul ' =data3) ; 


[pel_broadcast ] (pel_ctrl=l) -> (pel_ctrl' =2 ) ; 

// wait for message from BIU1 
[biul_send] (pel_ctrl=2) -> (pel_ctrl ' =3 ) ; 

// consume the message: copy it into pel_out 
[] (pel_ctrl=3) -> (pel_ctrl ' =4 ) & (pel_out' = pel_biul) ; 

endmodule 


// ERROR MODEL FOR BIU1 
module biul_error 

biul_f aulted : bool init false; 
biul_error: [0..3] init fault_free; 


[] 


(! biul_f aulted) -> 
biu_benign : 

+ biu_symmetric : 

+ biu_asymmetric : 

+ biu_nofault: 


(biul_error' = benign_f aulty ) & (biul_f aulted' =true) 
(biul_error' = symmetric_f aulty ) & 

(biul_f aulted' =true) 

(biul_error' = asymmetric_f aulty ) & 

(biul_f aulted' =true) 

(biul_error' = fault_free) & (biul_f aulted' =true) ; 


endmodule 


// BIU1 
module biul 

biul_ctrl: [0..6] init 0; 

// wait for broadcast from pel 

[pel_broadcast ] (biul_ctrl=0 ) -> (biul_ctrl' =1 ) ; 

// Fault-free case: consume the message and forward it to all rmus 
[] (biul_ctrl=l ) & (biul_error=f ault_f ree) & (pel_biul ! =bad_message) -> 

(biul_ctrl' =2) & (biul_rmul ' =pel_biul ) & 

(biul_rmu2 ' =pel_biul ) & (biul_rmu3 ' =pel_biul ) ; 

[] (biul_ctrl=l ) & (biul_error=f ault_f ree) & (pel_biul=bad_message) -> 

(biul_ctrl' =2) & (biul_rmul ' =pe_error ) & 

(biul_rmu2' =pe_error) & (biul_rmu3 ' =pe_error ) ; 

// Benign fault: send manifest bad message to all rmus 
[] (biul_ctrl=l ) & (biul_error=benign_f aulty) -> 

(biul_ctrl' =2) & (biul_rmul' =bad_message) & 

(biul_rmu2 ' =bad_message ) & (biul_rmu3' =bad_message) ; 

// Symmetric fault: send the same arbitrary message to all rmus 
[] (biul_ctrl=l ) & (biul_error=symmetric_f aulty ) -> (biul_ctrl ' =2 ) & 

(biul_rmul ' =datal ) & (biul_rmu2 ' =datal ) & (biul_rmu3' =datal) ; 
[] (biul_ctrl=l ) & (biul_error=symmetric_f aulty ) -> (biul_ctrl ' =2 ) & 

(biul_rmul ' =data2 ) & (biul_rmu2 ' =data2 ) & (biul_rmu3' =data2 ) ; 
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[] (biul_ctrl=l) & (biul_error=symmetric_f aulty) -> (biul_ctrl ' =2 ) & 

(biul_rmul ' =data3 ) & (biul_rmu2 ' =data3 ) & (biul_rmu3' =data3) ; 

// Asymmetric fault: send different messages to the rmus 
// TODO: add more combinations 

[] (biul_ctrl=l ) & (biul_error=asymmetric_f aulty ) -> (biul_ctrl' =2) & 

(biul_rmul ' =datal ) & (biul_rmu2 ' =data2 ) & (biul_rmu3' =data3) ; 

[] (biul_ctrl=l ) & (biul_error=asymmetric_f aulty ) -> (biul_ctrl' =2) & 

(biul_rmul ' =datal ) & (biul_rmu2 ' =data2 ) & (biul_rmu3' =data2 ) ; 

[] (biul_ctrl=l ) & (biul_error=asymmetric_f aulty ) -> (biul_ctrl' =2) & 

(biul_rmul ' =datal ) & (biul_rmu2 ' =datal ) & (biul_rmu3' =data2 ) ; 

[] (biul_ctrl=l) & (biul_error=asymmetric_f aulty ) -> 

(biul_ctrl'=2) & (biul_rmul' =bad_message) & 

(biul_rmu2 ' =datal ) & (biul_rmu3' =data3) ; 

[] (biul_ctrl=l) & (biul_error=asymmetric_f aulty ) -> (biul_ctrl' =2) & 

(biul_rmul' =pe_error) & (biul_rmu2 ' =datal ) & (biul_rmu3 ' =data2 ) ; 

// Synchronization: allow RMUs to read the message 
[biul_broadcast ] (biul_ctrl=2 ) > (biul_ctrl' =3) ; 

// wait for replay from all RMUs 
[rmu_relay] (biul_ctrl=3) -> (biul_ctrl ' =4 ) ; 

// consume the three messages from the RMU and compute the 
// majority 

[] (biul_ctrl=4 ) & (biul_error=fault_f ree) -> 

(biul_ctrl ' =5 ) & (pel_biul ' =votel ) ; 

[] (biul_ctrl=4 ) & (biul_error=benign_f aulty) -> 

(biul_ctrl ' =5 ) & (pel_biul' =bad_message) ; 

[] (biul_ctrl=4 ) & (biul_error=symmetric_f aulty ) -> 

(biul_ctrl ' =5 ) & (pel_biul ' =datal ) ; 

[] (biul_ctrl=4 ) & (biul_error=symmetric_f aulty ) -> 

(biul_ctrl ' =5 ) & (pel_biul ' =data2 ) ; 

[] (biul_ctrl=4 ) & (biul_error=symmetric_f aulty) -> 

(biul_ctrl ' =5 ) & (pel_biul ' =data3 ) ; 

[] (biul_ctrl=4 ) & (biul_error=symmetric_f aulty ) -> 

(biul_ctrl ' =5 ) & (pel_biul ' =pe_error ) ; 

[] (biul_ctrl=4 ) & (biul_error=symmetric_f aulty ) -> 

(biul_ctrl ' =5 ) & (pel_biul ' =source_error ) ; 

[] (biul_ctrl=4 ) & (biul_error=symmetric_f aulty ) -> 

(biul_ctrl ' =5 ) & (pel_biul ' =no_ma jority ) ; 

[] (biul_ctrl=4 ) & (biul_error=asymmetric_f aulty ) -> 

(biul_ctrl ' =5 ) & (pel_biul ' =datal ) ; 

[] (biul_ctrl=4 ) & (biul_error=asymmetric_f aulty ) -> 

(biul_ctrl ' =5 ) & (pel_biul ' =data2 ) ; 

[] (biul_ctrl=4 ) & (biul_error=asymmetric_f aulty ) -> 

(biul_ctrl ' =5 ) & (pel_biul ' =data3 ) ; 

[] (biul_ctrl=4 ) & (biul_error=asymmetric_f aulty ) -> 

(biul_ctrl ' =5 ) & (pel_biul ' =pe_error ) ; 

[] (biul_ctrl=4 ) & (biul_error=asymmetric_f aulty ) -> 

(biul_ctrl ' =5 ) & (pel_biul ' =source_error ) ; 
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[] (biul_ctrl=4 ) & (biul_error=asymmetric_f aulty ) -> 

(biul_ctrl ' =5 ) & (pel_biul ' =no_ma jority ) ; 

[biul_send] (biul_ctrl=5 ) -> (biul_ctrl ' =6 ) ; 
endmodule 


// RMU1 
module rmul 

rmul_ctrl: [0..3] init 0; 

// Wait for data from BIU1 

[biul_broadcast ] (rmul_ctrl=0) -> (rmul_ctrl' =1) ; 

// consume the message and forward it to all BIUs 
[] (rmul_ctrl=l) -> (rmul_ctrl' =2) & (biul_rmul ' =biul_rmul ) & 

(biu2_rmul ' =biul_rmul ) & (biu3_rmul ' =biul_rmul ) ; 
[rmu_relay] (rmul_ctrl=2) -> (rmul_ctrl' =3) ; 

// done 

// [] (rmul_ctrl=3) -> (rmul_ctrl' =3) ; 

endmodule 


// RMU2 and RMU3 are like RMU1, modulo renaming 

module rmu2 = rmul [rmul_ctrl=rmu2_ctrl,biul_rmul=biul_rmu2, 

biu2_rmul=biu2_rmu2 , biu3_rmul=biu3_rmu2 ] 


endmodule 


module rmu3 
endmodule 


rmul [rmul_ctrl=rmu3_ctrl, biul_rmul=biul_rmu3 , 
biu2_rmul=biu2_rmu3 , biu3_rmul=biu3_rmu3 ] 


// BIU2 
module biu2 

biu2_ctrl : [ 0 . . 3 ] init 0; 

// Wait for data from RMUS 

[rmu_relay] (biu2_ctrl=0 ) -> (biu2_ctrl ' =1 ) ; 

// consume the messages + vote 

[] (biu2_ctrl=l ) -> (biu2_ctrl ' =2 ) & (pe2_biu2 ' =vote2 ) ; 

[biu2_send] (biu2_ctrl=2 ) -> (biu2_ctrl ' =3 ) ; 

// done 

// [] (biu2_ctrl=3 ) -> (biu2_ctrl' =3) ; 
endmodule 

// BIU3: same as BIU2 modulo renaming 

module biu3 = biu2 [biu2_ctrl=biu3_ctrl , biu2_send=biu3_send, 

biu2_rmul=biu3_rmul , biu2_rmu2=biu3_rmu2 , 
biu2_rmu3=biu3_rmu3 , pe2_biu2=pe3_biu3 ] 

endmodule 
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// PE2 
module pe2 

pe2_ctrl: [0..2] init 0; 

pe2_out: [0..7] init 0; 

// wait for message from BIU2 
[biu2_send] (pe2_ctrl=0) -> (pe2_ctrl ' =1 ) ; 

// consume the message 

[] (pe2_ctrl=l) -> (pe2_ctrl ' =2 ) & (pe2_out' = pe2_biu2) ; 

// done 

// [] (pe2_ctrl=2) -> (pe2_ctrl' =2) ; 

endmodule 

// PE3: same as PE2 modulo renaming 

module p3 = pe2 [pe2_ctrl=pe3_ctrl, pe2_out=pe3_out, 

biu2_send=biu3_send, pe2_biu2=pe3_biu3 ] 

endmodule 


label "done" = pel_ctrl=4 & pe2_ctrl=2 & pe3_ctrl=2; 

label "all_good" = (pel_out=source ) & (pe2_out=source) & (pe3_out=source) ; 

label "unanimous" = (pel_out=pe2_out ) & (pel_out=pe3_out ) ; 
label "nonfaulty_agree" = (pe2_out=pe3_out ) ; 

label "nonf aulty_valid" = (pe2_out=source) & (pe3_out=source) ; 


Properties for this model 

// The broadcast protocol terminates with probability 1. 
"init"=>P>=l [ F "done" ] 

P>=1 [ G ( "done"=>"all_good" ) ] 

P>=1 [ G ( " done " => " unanimous " ) ] 


Pmin =? 

[ G ( 

"done" 

= > 

"unanimous") ] 

Pmax =? 

[G ( " 

done" 

= > " 

unanimous " ) ] 

Pmin =? 

[G ( 

"done" 

= > 

"nonf aulty_agree" 

Pmax =? 

[G ( 

"done" 

= > 

"nonf aulty_agree" 

Pmin =? 

[G ( 

"done" 

= > 

"nonf aulty_valid" 

Pmax =? 

[G ( 

"done" 

= > 

"nonf aulty_valid" 
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